import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style = 'darkgrid')
'''Just exploring what's in the data set'''
"""Dataset can be found here: https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Daily-Totals/5neh-572f"""
ridership = pd.read_csv('CTA_-_Ridership_-__L__Station_Entries_-_Daily_Totals.csv')
print(ridership.describe())
print(ridership.corr())
print(ridership.head())
Which stop has the highest average ridership per day, and what is it?
'''Average rides barchart'''
ridership.groupby('stationname').rides.mean().sort_values(ascending = True).plot(kind = 'barh', figsize = (25,25))
plt.title('Average rides per station')
plt.show()
plt.clf()
'''What is the averagen number of rides for each daytype?'''
ridership_w = ridership[ridership.daytype == 'W']
print(ridership_w.groupby('stationname').rides.mean().sort_values(ascending = False).head())
ridership_a = ridership[ridership.daytype == 'A']
print(ridership_a.groupby('stationname').rides.mean().sort_values(ascending = False).head())
ridership_u = ridership[ridership.daytype == 'U']
print(ridership_u.groupby('stationname').rides.mean().sort_values(ascending = False).head())
The highest average ridership per day overall and during weekdays is Clark/Lake.
However, during Saturdays, it is Chicago/State and on Sundays is O'Hare Airport(seems like it's popular to fly from/to O'Hare Airport on a Sunday.)
What’s the standard deviation for the Washington/Wabash stop? What’s your hypothesis for why?
'''Selecting only washington/wabash'''
washington = ridership[ridership.stationname =='Washington/Wabash'].reset_index()
print(washington)
'''A brief overview of washington/wabash'''
print(washington.describe())
'''Taking a quick look at the std of the rest of the stations'''
ridership_std = ridership.groupby(['stationname', 'daytype']).rides.std().sort_values(ascending = False).reset_index()
print(ridership_std)
print(ridership_std[ridership_std.stationname == 'Washington/Wabash'])
sns.violinplot(data = washington, x = 'rides', y = 'daytype')
plt.title('W=Weekday, A=Saturday, U=Sunday/Holiday')
plt.show()
plt.clf
sns.boxplot(data=washington, x='rides', y='daytype')
plt.title('W=Weekday, A=Saturday, U=Sunday/Holiday')
plt.show()
The std of Washington/Wabash rides is 3422.595650 overall. 2731.567544 during weekdays, 1886.062283 during Saturdays and 1515.673957 on Sundays.
However, I do believe that washington/wabash should have a smaller standard deviation if we eliminate the outliers of 0 number of rides in August 2017, during which the station was still in the midst of being built.
Completely away from the topic of std, it interesting to note that washington/wabash is 11th highest for average number of rides with its highest traffic during the weekday. Which may point to washington/wabash as a train station located in an area where people have to stop at for work. If not, washington/wabash could be an area where people stay that has high traffic because the station has 5 lines which is used to commute to work. The surrounding areas look nice enough and even has a park according to Google maps, which gives the people less incentive to venture away from the area during the Weekends.
Please choose a specific business and tell us which business you chose; any kind of business will do. Imagine you’re helping that business owner in Chicago and s/he is looking to open a new location. In the form of writing, potentially supplemented by sketches (computer-drawn or hand-drawn) and links, we want to see your response to these questions:
Furthermore, we want to see the results of 2–3 hours of work, using the real data, towards making those ideas a reality. The results could include findings from the data, code, Python/R notebooks, a visualization, results of a statistical model you built, etc. Try not to hide things or throw them away— we want to see your work!
I think with the data, we can explore:
- Any up and coming locations from an increase of rides
- The top popular stations in average rides
- The number of ramen competitors around select train stations
The first two questions can be easily explored with our current data set. But the number of competitors around train stations is much more iffy. I believe I may have to go to a review website like yelp to gather the data.
I think it would be an interesting visualisation if I managed to mark all of the up and coming train stations, popular train stations and competitors of ramen restaurants in one map. This model can be used to predict future popular areas and mark out any blue ocean areas for the client to open their new store!
'''I'm importing a new data set which is a monthly version of the original data set given for this assessment
because my computer can't handle the large amount of data to create facetgrid graphs.
Data set link: https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Monthly-Day-Type-A/t2rn-p8d7'''
ridership_m = pd.read_csv('CTA_-_Ridership_-__L__Station_Entries_-_Monthly_Day-Type_Averages___Totals.csv')
'''Changing the date to y/m/d because it's easier to sort values'''
ridership_m['month_beginning'] = pd.to_datetime(ridership_m.month_beginning).dt.strftime('%Y/%m/%d')
print(ridership_m.head())
ridership_mg = ridership_m.sort_values(['stationame', 'month_beginning'])
print(ridership_mg.head())
print(ridership_mg.dtypes)
split_df = ridership_mg['month_beginning'].str.split('(\d+)', expand=True)
ridership_mg['year'] = pd.to_numeric(split_df[1])
ridership_mg['month'] = pd.to_numeric(split_df[3])
ridership_year = ridership_mg.groupby(['stationame', 'year', 'month']).monthtotal.sum().reset_index()
print(ridership_year)
ax = sns.lmplot(x = 'year', y = 'monthtotal', col = 'stationame',
data = ridership_year, truncate = True,
col_wrap = 5)
ax.set(xticks=np.arange(2001,2018,4))
Addison-North Main Belmont North Main California/Milwaukee Cermak-Chinatown Chicago/Franklin Chicago/Milwaukee Chicago/State Clark/Lake Clinton/Lake Damen/Milwaukee Diversey Division/Milwaukee Fullerton Grand/Milwaukee Grand/State Harrison Lake/State Library Logan Square Merchandise Mart Monroe/Dearborn Monroe/State North/Clybourn O’Hare Airport Roosevelt State/Lake Washington/Dearborn Washington/Milwaukee
While the stations with more than 150,000 riders as the average are:
95th Dan/Ryan Adams/Wabash Addison-North Main Belmont-North Main California/Milwaukee Chicago/Franklin Chicago/State Clark/Division Clark/Lake Damen/Milwaukee Fullerton Grand/State Jackson/Dearborn Jackson/State Jefferson/Park Lake/State Logan Square Loyola Merchandise Mart Midway Airport Monroe/Dearborn Monroe/State O’Hare Airport Quincy/Wells Randolph/Wabash Roosevelt Rosemont Sheridan State/Lake UC-Halsted Washington/Dearborn Washington/Wells Washington/Milwaukee Wilson
My first big problem! I don't know how to webscrape yet with Beautifulsoup. Luckily after googling I found out there are webscraping programs like parsehub which I will be using for this.
#ramen = pd.read_csv('ramen3.csv')
#ramen = ramen[ramen.restaurant_name_address.notnull()]
#print(ramen.restaurant_address_state)
'''My second big problem is now realising that I can't input all of the addresses one by one into a latitude/longitude
converter! Google comes to the rescue again! I managed to find a library called geopy!
Link: https://geopy.readthedocs.io/en/latest/#'''
#from geopy.geocoders import Nominatim
#latlong = []
#for i in ramen.restaurant_address_state:
#geolocator = Nominatim(user_agent="my-application", timeout = 5)
#location = geolocator.geocode(i)
#print((location.latitude, location.longitude))
#latlong.append([location.latitude, location.longitude])
'''The code was giving me much problems, so unfortunately I had to do some things manually, such as changing the
addresses of the wrong addresses by crossreferencing google maps.
The len(latlong) helped me find which index the code stopped at.'''
#print(len(latlong))
'''Adding the latitude and longitude before saving it into a csv'''
#ramen['latlong'] = latlong
#ramen.to_csv('ramen_geo.csv')
"""I was surfing youtube and came across this video: https://youtu.be/Yd5oEIBFQ_E
In the video, Halfdan Rump talks about a python library called folium which fitted perfectly with what I wanted to do
The codes below are to combine the two given data sets on their MAP_ID/station_id so each station the data set will
be complete with each station's location as well as ward number"""
"""Dataset link: https://data.cityofchicago.org/Transportation/CTA-System-Information-List-of-L-Stops/8pix-ypme"""
station = pd.read_csv('CTA_-_System_Information_-_List_of__L__Stops.csv')
station2 = station[['STATION_NAME', 'MAP_ID', 'Location', 'Zip Codes', 'Wards']].reset_index()
station_ward = station2.groupby(['MAP_ID','STATION_NAME', 'Location', 'Zip Codes', 'Wards']).index.sum().reset_index()
ridership_mean = ridership.groupby(['station_id', 'stationname']).rides.mean().reset_index()
merge = pd.merge(ridership_mean, station_ward, left_on = 'station_id', right_on='MAP_ID', how = 'outer')
#merge2 = merge[merge.isnull().any(axis=1)]
"""Changing wards from float to string because folium takes a string input for the database to match with the geojson"""
merge = merge[merge['Wards'].notnull()]
merge.Wards = merge.Wards.astype(int)
merge.Wards = merge.Wards.astype(str)
#print(merge)
"""exporting high and rising stations into a new csv"""
"""high = merge[merge['stationname'].isin(["95th/Dan Ryan", "Adams/Wabash", "Addison-North Main",
"Belmont-North Main", "California/Milwaukee", "Chicago/Franklin",
"Chicago/State", "Clark/Division", "Clark/Lake", "Damen/Milwaukee",
"Fullerton Grand/State", "Jackson/Dearborn", "Jackson/State", "Jefferson/Park",
"Lake/State", "Logan Square", "Loyola", "Merchandise Mart", "Midway Airport",
"Monroe/Dearborn", "Monroe/State", "O’Hare Airport", "Quincy/Wells", "Randolph/Wabash",
"Roosevelt", "Rosemont", "Sheridan", "State/Lake", "UC-Halsted", "Washington/Dearborn",
"Washington/Wells", "Washington/Milwaukee Wilson"])]
high.to_csv('station_high.csv')
rising = station[station['STATION_NAME'].isin(["Addison-North Main", "Belmont North Main",
"California/Milwaukee","Cermak-Chinatown", "Chicago/Franklin", "Chicago/Milwaukee",
"Chicago/State", "Clark/Lake", "Clinton/Lake", "Damen/Milwaukee", "Diversey",
"Division/Milwaukee", "Fullerton", "Grand/Milwaukee", "Grand/State", "Harrison",
"Lake/State", "Library", "Logan Square", "Merchandise Mart", "Monroe/Dearborn",
"Monroe/State", "North/Clybourn", "O’Hare Airport", "Roosevelt", "State/Lake",
"Washington/Dearborn", "Washington/Milwaukee"])]
rising.to_csv('station_rising.csv')"""
high = pd.read_csv("station_high.csv")
high.Location = high.Location.astype(str)
split_df = high.Location.str.split('(,)', expand=True)
split_df[0] = split_df[0].str.replace('(', '')
split_df[2] = split_df[2].str.replace(')', '')
high['lat'] = pd.to_numeric(split_df[0])
high['long'] = pd.to_numeric(split_df[2])
rise = pd.read_csv("station_rising.csv")
rise.Location = rise.Location.astype(str)
split_df2 = rise.Location.str.split('(,)', expand=True)
split_df2[0] = split_df2[0].str.replace('(', '')
split_df2[2] = split_df2[2].str.replace(')', '')
rise['lat'] = pd.to_numeric(split_df2[0])
rise['long'] = pd.to_numeric(split_df2[2])
ramen = pd.read_csv("ramen_geo.csv")
ramen.latlong = ramen.latlong.astype(str)
split_df3 = ramen.latlong.str.split('(,)', expand=True)
split_df3[0] = split_df3[0].str.replace('[', '')
split_df3[2] = split_df3[2].str.replace(']', '')
ramen['lat'] = pd.to_numeric(split_df3[0])
ramen['long'] = pd.to_numeric(split_df3[2])
import folium
from folium.plugins import MarkerCluster
"""Geojson data link here: https://www.chicago.gov/city/en/depts/doit/dataset/boundaries_-_wards.html"""
m = folium.Map(location=[41.881832, -87.623177],zoom_start = 10, tiles = 'cartodbpositron')
folium.Choropleth(
geo_data="Boundaries Wards.geojson",
name='choropleth',
data=merge,
columns=['Wards', 'rides'],
key_on='feature.properties.ward',
fill_color='BuPu',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Average number of rides per Ward in Chicago'
).add_to(m)
marker_cluster = MarkerCluster(name='Stations with high average rides').add_to(m)
marker_cluster2 = MarkerCluster(name='Stations that are rising').add_to(m)
marker_cluster3 = MarkerCluster(name='Other ramen stores in Chicago').add_to(m)
folium.LayerControl().add_to(m)
for i in range(0,len(high)):
folium.Marker([high['lat'][i], high['long'][i]], popup=high['STATION_NAME'][i], icon=folium.Icon(color='blue')).add_to(marker_cluster)
for i in range(0,len(rise)):
folium.Marker([rise['lat'][i], rise['long'][i]], popup=rise['STATION_NAME'][i], icon=folium.Icon(color='green')).add_to(marker_cluster2)
for i in range(0,len(ramen)):
folium.Marker([ramen['lat'][i], ramen['long'][i]], popup=ramen['restaurant_name_name'][i], icon=folium.Icon(color='lightgray')).add_to(marker_cluster3)
"""Exporting the file because it's too big to be rendered here"""
m.save('Choropleth of Chicago no of rides per ward.html')
"""It is rather unfortunate that within the data set, there are multiple stations which did not have any wards assigned
which is why there are multiple areas that are darkened, and practically all of our popular stations are situated within
those areas.
There are 2 wards which have high average number of rides per ward and 3 that comes in second. Cermak-Chinatown
station just so happens to be in one of these 2nd lesser purple wards."""
from IPython.display import IFrame
IFrame(src='Choropleth of Chicago no of rides per ward.html', width=1000, height=650)
Putting it all together, we now have a map of Chicago's list train stations(blue trains), ramen stores from Yelp(green noodle bowls), high average train stations(red hearts) and up and coming train stations(yellow stars).
The choropleth may be a little cluttered so here's a google maps link:
https://drive.google.com/open?id=1mVyUlsKUbV9kNtMLWH67CX9xnBvf_Qli&usp=sharing
We can't really see what's going on when it's so zoomed out, so let's take a closer look.
The N Milwaukee Ave seems to be a rather popular stretch. Stations from Grand Milwaukee to Logan Square have a high average. Which probably explains the large number of other ramen stores along the stretch.
The main chicago area is filled with train stations with high ridership but most likely the rent in this area will be high to match the upscale area. At the same time, there are multiple ramen stores that already exist in this area as competitors.
Other areas which have high ridership train stations and no ramen stores are upper Chicago and Roosevelt station
The area with the highest number of high traffic stations, as well as seeing a rise in ridership is within the upper Chicago area